Fast parallel CRC algorithm and implementation on a configurable processor
نویسندگان
چکیده
-In this paper we present a fast cyclic redundancy check (CRC) algorithm that performs CRC computation for any length of message in parallel. For a given message with any length, we first chunk the message into blocks, each of which has a fixed size equal to the degree of the generator polynomial. Then we perform CRC computation among the chunked blocks in parallel using Galois Field multiplication and accumulation (GFMAC). Theoretically our fast parallel CRC algorithm can achieve unlimited speedup over the bit-serial algorithm or byte-wise table lookup algorithm at the expense of adding enough GFMAC units. Our algorithm can perform CRC computation for any lengthy message with 2 to 3 clock cycles. In practice, we choose to use a configurable processor where a customized instruction is added to perform multiple pairs of GF multiplication and accumulation. For example, a 4-GFMAC implementation can compute a 32-bit CRC in 2 to 3 cycles for a 16-byte message. This level of performance is hundreds of times faster than bit-serial CRC algorithm or tens of times faster than byte-wise parallel CRC algorithm. The generator polynomial can be chosen to be software programmed or hard-coded. Our algorithm adds only a small number of logical gates to the processor core.
منابع مشابه
Fast Cellular Automata Implementation on Graphic Processor Unit (GPU) for Salt and Pepper Noise Removal
Noise removal operation is commonly applied as pre-processing step before subsequent image processing tasks due to the occurrence of noise during acquisition or transmission process. A common problem in imaging systems by using CMOS or CCD sensors is appearance of the salt and pepper noise. This paper presents Cellular Automata (CA) framework for noise removal of distorted image by the salt an...
متن کاملOptimizing Matrix-matrix Multiplication for an Embedded Vliw Processor
The optimization of matrix-matrix multiplication (MMM) performance has been well studied on conventional general-purpose processors like the Intel Pentium 4. Fast algorithms, such as those in the Goto and ATLAS BLAS libraries, exploit common microarchitectural features including superscalar execution and the cache and TLB hierarchy to achieve near-peak performance. However, the microarchitectur...
متن کاملA Low-Power High Throughput Configurable FFT/IFFT Processor for WLAN and WiMax Protocols
This paper presents a configurable Fast Fourier Transform (FFT) processor targeting the IEEE 802.11n (WLAN) and the IEEE 802.16 (WiMax) wireless protocols. Such processor is based upon the Radix-2 SinglePath Delay Feedback (R2SDF) architecture and can be configured to operate on 64/128/512/1024/2048-point sequences. It was synthesized for a 90nm commercial standard-cells library by using Synops...
متن کاملA Coarse-Grain Hierarchical Technique for 2-Dimensional FFT on Configurable Parallel Computers
FPGAs (Field-Programmable Gate Arrays) have been widely used as coprocessors to boost the performance of data-intensive applications [1][2]. However, there are several challenges to further boost FPGA performance: the communication overhead between the host workstation and the FPGAs can be substantial; large-scale applications cannot fit in a single FPGA because of its limited capacity; mapping...
متن کاملAn Effective Hybrid Genetic Algorithm for Hybrid Flow Shops with Sequence Dependent Setup Times and Processor Blocking
Hybrid flow-shop or flexible flow shop problems have remained subject of intensive research over several years. Hybrid flow-shop problems overcome one of the limitations of the classical flow-shop model by allowing parallel processors at each stage of task processing. In many papers the assumptions are generally made that there is unlimited storage available between stages and the setup times a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002